1 The condition of Tanzanian water access points

Water is life. This is especially true in places where it is sparse as in huge parts of Africa. To provide people with fresh water organizations build water pumps, but oftentimes they do not further pay for maintenance and they break down, becoming useless. The online platform Taarifa collects data of water pumps in Tanzania and wants to predict, which ones are broken or will soon break down to be able to organize maintenance. The data science competition platform drivendata.com hosts a challenge, where the community can help with the prediction effort (http://www.drivendata.org/competitions/7/page/23/). The data used in this report corresponds to the training data provided for the challenge.

2 Data Preparation

As a first step the variable given in the data set will be inspected:

##  [1] "amount_tsh"            "basin"                
##  [3] "construction_year"     "date_recorded"        
##  [5] "district_code"         "extraction_type"      
##  [7] "extraction_type_class" "extraction_type_group"
##  [9] "funder"                "gps_height"           
## [11] "id"                    "installer"            
## [13] "latitude"              "lga"                  
## [15] "longitude"             "management"           
## [17] "management_group"      "num_private"          
## [19] "payment"               "payment_type"         
## [21] "permit"                "population"           
## [23] "public_meeting"        "quality_group"        
## [25] "quantity"              "quantity_group"       
## [27] "recorded_by"           "region"               
## [29] "region_code"           "scheme_management"    
## [31] "scheme_name"           "source"               
## [33] "source_class"          "source_type"          
## [35] "status_group"          "subvillage"           
## [37] "ward"                  "water_quality"        
## [39] "waterpoint_type"       "waterpoint_type_group"
## [41] "wpt_name"

Some of the variable seem to represent the same data at different levels of exactness. As for example source, source_type and source_class:

That can also be seen, when plotting them against each other. e.g.:

## [1] "groundwater :"
## [1] "shallow well" "machine dbh"  "spring"       "hand dtw"    
## [1] ""
## [1] "surface :"
## [1] "rainwater harvesting" "river"                "lake"                
## [4] "dam"                 
## [1] ""
## [1] "unknown :"
## [1] "other"   "unknown"
## [1] ""

As can be seen in the plot source_class has less levels, but each level of source is just represented by a single level of source_class. Thus the data will be reduced by removing some of variables. To do this mostly the least detailed variables were chosen. But in case of source this would only leave 3 levels. This might be too little information. What can already be observed is that most water points collect water from groundwater sources and that different techniques are used for harvesting.

The wanterpoint_type variables are similar. Here only the factors ‘communal standpipe’ and ‘communal standpipe multiple’ were aggregated. Since no major differences between those two categories would be expected the less complex variant was kept in the data set. Communal standpipes are the most common waterpoint type followed by hand pumps. Other extraction types are comparably rare.

The more detailed extraction type variables give detailed mechanism trade names, which may in some cases be of interest, but for the purposes needed her only the classes will be taken into account, since those give the overall type of mechanism instead of specific products. The histogram shows that most water points use either gravity or manual power to pump up the water, compared to motorized systems.

In this case the decision which of the different variable depicting the same data to pick fell again to the less complex one, management group. Management and scheme_management are quite similar. Management is has only one less level. By far most water points are managed by communal management schemes, especially VWC (village water community).

For the payment variables only the naming of the factor levels is different. Thus it does not matter, which of the two variables is kept. Roughly half of the water points have no payment scheme. There are payment schemes with regular payments and others with payments on demand. For a significant amount of water points the payment mode is unknown.

The variables quantity and quantity_group are even completely the same. Most water points (about half) give enough water.

The two variables descibing the water quality are different in whether they differentiate between abandoned and not abandoned water points. In general whether a water point is abandoned is not a term of water quality. Thus it would make more sense to create a new variable, but since only a very small amount of water points are labeled as abandoned, the information gain might be little. Additionally since this is not already a measured category false negative labeling might be very probable for a lot of water points. Thus quality_group is kept for further analysis.

This step reduces the number of variables from 41 to 31. Further the variables ‘id’, ‘recorded_by’ and ‘num_private’, ‘wpt_name’ are associated to the data collection or just in naming the water point and will thus not contribute to the prediction.

The histograms as well do not show a high variety or they show a huge number of factor levels and thus that these variable probably do not add to a better prediction. Thus they will be removed from the data set as well. This leaves the following variables:

##  [1] "amount_tsh"            "basin"                
##  [3] "construction_year"     "date_recorded"        
##  [5] "district_code"         "extraction_type_class"
##  [7] "funder"                "gps_height"           
##  [9] "installer"             "latitude"             
## [11] "lga"                   "longitude"            
## [13] "management_group"      "payment_type"         
## [15] "permit"                "population"           
## [17] "public_meeting"        "quality_group"        
## [19] "quantity_group"        "region"               
## [21] "region_code"           "scheme_name"          
## [23] "source_type"           "status_group"         
## [25] "subvillage"            "ward"                 
## [27] "waterpoint_type_group"

Intuitively the location may be a good predictor for the functionality, as well as whether there is a fee or the age of the well. Naturally the water quantity also indicates functionality. Additional factors like public meetings or water quality may also improve the prediction, while the funder probably won’t influence the prediction a lot, especially because the variable contains a large amount of levels. Some variables have a huge amount of levels. This does not allow for creating a model and probably does not add a lot of information.

## [1] "date_recorded" "funder"        "installer"     "subvillage"   
## [5] "lga"           "ward"          "scheme_name"

The variables subvillage, lga, district_code, region_code and ward are like region and the coordinates regional descriptors, thus redundant and due to the high number of levels probably less helpful than the region variable. Thus they will be removed from the data set.

The date_recorded variable can be made variable of the date class:

The scheme_name variable will be removed.

The variables funder and installer may be of interest, since experience and determination may influence the status of the water points. A possible solution would be to make new variables that state the number of water points a company as funded or installed, thus transforming this categorical variable into a continuous numeric variable.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3 Variable Importance

The cleaned data set has a much reduced number of dimensions, but of the remaining variables not all will be similarly important for predicting the functionality of the water points. Intuitively the date at which the status was recorded won’t be as important as the water quantity given by the water point.

To get a more profound idea of the importance of the single variables a quick random forest model was created. The Mean Decrease Accuracy was then used as a measure of variable importance.

## 
## Call:
##  randomForest(formula = status_group ~ ., data = data_refined,      importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 18.31%
## Confusion matrix:
##                         functional functional needs repair non functional
## functional                   29114                     692           2453
## functional needs repair       2246                    1471            600
## non functional                4578                     305          17941
##                         class.error
## functional               0.09749217
## functional needs repair  0.65925411
## non functional           0.21394147
##      management_group         quality_group        public_meeting 
##           0.008092388           0.008244083           0.008730163 
##                permit       installer_count            population 
##           0.014781005           0.034301211           0.038517277 
##            amount_tsh          funder_count            gps_height 
##           0.041360119           0.050416835           0.051499478 
## waterpoint_type_group         date_recorded          payment_type 
##           0.056711447           0.057654982           0.057759814 
##           source_type                 basin     construction_year 
##           0.059698458           0.062127196           0.066286177 
## extraction_type_class              latitude             longitude 
##           0.075507679           0.083309073           0.089017176 
##                region        quantity_group 
##           0.098740070           0.107483520

The importance scores how that the variables water_quality, public_meeting and permit do not influence the model much and thus are probably not very interesting to look at. On the other hand the location of the water point and as expected the quantity of water it gives seem to be good predictors.

4 Exploration

As a first graphic the distribution of the different states the water points are in were plotted:

It seems like more than half of the pumps are still working, but still a lot are broken down.

4.0.1 Quantity

The variable with the highest importance score was quantity. That makes a lot of sense, since a broken water_point probably will give less water than a functioning one. To visualize this a contingency table, listing the water point status group against the water quantity. The values were normalized to the total amount of water points in one quantity-group. A heatmap was used to visualize the data:

## NULL

As expected dry wells are mostly non functional, while wells still giving water are mostly functional. Wells, where the quantity is unknown, are mostly non functional. Missing data might hint to a bad maintenance, which would further indicate, that this water points are broken.

4.0.2 Location

Some of the most important variables are the region and the coordinates of the water points. So first the regions were depicted by plotting them on a map of Tanzania.

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Tanzania&zoom=6&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Tanzania&sensor=false

Looking at the functionality of the water points and plotting them onto the map was more informative:

There seem to be some regions, where there are differences in the functionality of the water points in some regions. Especially in more remote areas there seems to be a higher fraction of non functional water points than in more urban areas. The region variable has a significantly higher importance than the coordinates itself. Thus there are probably con-founders behind the region variable. The different regions are most probably different in their economy, tourism and industry among other factors, which could influence the status of the wells. Those factors could be worth adding as variables in the data set. The influence of tourism might be seen when looking closer at Mount Kilimanjaro.

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Mount+Kilimanjaro&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Mount%20Kilimanjaro&sensor=false

The highest density of functional water points is in the area, where the main access points to the mountain are located. The highest density of non functional water points on the other hand is at the eastern side of the mountain. Thus tourism might be a reason to better maintain water pumps.

Mtwara is one of the bigger cities in Tanzania and a region with a high density of non functional water points. Having a closer look at the region one can notice that the functional water points are mainly in the city area, while most non functional water points can be found in the rural areas around the city.

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Mtwara&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Mtwara&sensor=false

## NULL

The regions Tabora, Mara, Rukwa, Mtwara and Lindi have the highest fraction of non functional water points, while most water points in the regions Manyara, Kilimanjaro, Shinyanga, Kigoma, Iringa and Arusha are functional. In the remaining regions the ratio is more balanced, but still tipped to the functional side.

4.0.3 Waterpoint Type

## NULL

The water point_type variable again shows, that missing values hint for mostly non functional pumps. The other water point types are mostly functional.

4.0.4 Extraction Type

## NULL

This heatmap clearly shows, that water points not dependent on motorization are more often functional. motorized pumps still work a lot of the time, but a comparably higher amount is broken. If the extraction type is unknown, the respective water points are most of the time non functional.

4.0.5 Construction Year

As one would expect older water points are more often non functional then newer ones. There seems to be a nice linear relationship between the construction year and the relative amount of functional water points. The number of water points that need repair is relatively steady, probably because it is a rather transient state.

4.0.6 Basin

## NULL

The basin variable is again a regional variable. The difference between the basins is not very big, but two basins have less functional water points than others: Ruvuma and Lake Rukwa.

4.0.7 Source

## NULL

The source variable shows that less artificial sources seem to be more often functional. The fact that dams seem to be the source in the worst condition is confusing, since the water point type shows the opposite trend for dams.

4.0.8 Payment

## NULL

Setting up some payment for the usage of the water point is clearly beneficial in terms of maintenance. Payed water points are functional in 60 - 75 %, while unpayed water points are broken down half of the time.

4.0.9 Date recorded

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data were mostly recorded in 2011 and late 2013, 2014. While the density is quite similar for every outcome at the second peak, in early 2013 there

4.0.10 gps_height

At higher heights there is a higher density of functional water points, but overall the height probably does not help to distinguish functional and non functional water points.

4.0.11 funder_count

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

One would think that after funding many water points, a funding agency would improve the maintenance of their investments, but especially the water points funded by the Government of Tanzania, which funded the highest number of water points, are more frequently non functional than water points funded by less experienced organizations.

4.0.12 amount_tsh

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The total static head (tsh) is a rather technical measure. Most values are given as zero. Tsh is an indicator of how much overall water is still available at the water point. Since it is unlikely that so many water points do not have any water at all, especially ~50% are still working, 0 probably represents NA. It is noticeable that if the tsh value is given the fraction of functional water points is bigger. That this value was determined at all probably indicates the presence of an engineer and thus better maintenance. Thus the variable could be reduced to a logical variable indicating the availability of this measure.

## NULL

The reduced variable shows the same trend, as seen in the amount_tsh variable and thus may be as helpful for the model.

4.0.13 population

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It seems that in areas with a population of 10 to 100 people there is a higher fraction of functional water points than in areas with lower or higher population.

4.0.14 installer_count

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There seems to be no real influence of the installer ‘experience’ to whether the water points are functional.

4.0.15 permit

## NULL

Whether there is a permit for the water point, does not make a difference.

4.0.16 public_meeting

## NULL

Public meetings seem to have a very small positive impact.

4.0.17 quality_group

## NULL

Two values have a clear impact on the water point status. Again ‘unknown’ most of the time means a non functional water point. Fluoride in return seems to improve the status of the wells quite a lot. Although since it serves as an inorganic antibiotic, this is probably not due to technical improvements and the effect may actually be caused by a con-founder. Thus fluoride treatment may hint for maintenance in general.

4.0.18 management

## NULL

The management of the water points seems to have an overall small impact on functionality. ‘Unknown’ again means the highest fraction of non-functional water points and commercially managed water points seem to have the highest fraction of functional water points. This is not unexpected since there is commercial interest behind it. The payment variable might very well be a con-founder of this variable.

4.1 Using NAs as a predictor

One observation that was made during the analysis of the given variables was, that NAs or values representing it most of the time corresponded with a higher fraction of non functional water points. Thus here a new variable will be created giving the number of NAs of the respective entry. As a first step NAs have to be identified and named accordingly. Some of the changes already made will be reversed: namely removing installer, funder, amount_tsh. This is done, since the variable created from them do not represent the NAs in a suitable way for this analysis. Since longitude and latitude are both only part of a coordinate, if one is an NA the other will be as well. To not overrepresent the NAs, the longitude variable will be removed here,

Then the different values representing NAs are transformed to NA. In numeric values the NA-value is 0 or -2e-08. This can be problematic, since 0 could actually be the measured value. For this analysis it will be assumed that this is not the case. Since 0 is the stroongest represented value in numerical variables it is very likely that it is meant as NA. In categorical or logical variables it can be ‘0’, ‘-’, ‘other’, ‘Other’, ‘Others’, ‘’, ’Unknown’ or ‘unknown’. There are probably other values representing unknown, which were not identified, but they should represent a minority of NAs, thus should not have a strong impact on the analysis.

The total number of NAs is 147695 of 1366200, which is 10.8106427 %. That is quite a lot. Next a new variable is created containing the number of NAs of the respective entry.

With a higher amount of NAs for an entry the trend goes to a decreased fraction of functional water points. The error of the function gets bigger at higher counts of NAs, reducing the goodness of the fit, but this is probably due to the lower number of data entries having this high number of NAs.

Using this slightly further refined data set, a new model was build and compared with the first model:

## [1] "Model 1"
## 
## Call:
##  randomForest(formula = status_group ~ ., data = data_refined,      importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 18.31%
## Confusion matrix:
##                         functional functional needs repair non functional
## functional                   29114                     692           2453
## functional needs repair       2246                    1471            600
## non functional                4578                     305          17941
##                         class.error
## functional               0.09749217
## functional needs repair  0.65925411
## non functional           0.21394147
## [1] "Model 2"
## 
## Call:
##  randomForest(formula = status_group ~ ., data = data_refined) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 18.4%
## Confusion matrix:
##                         functional functional needs repair non functional
## functional                   29164                     655           2440
## functional needs repair       2298                    1419            600
## non functional                4631                     303          17890
##                         class.error
## functional               0.09594222
## functional needs repair  0.67129951
## non functional           0.21617596

The out of box error rate does not seem to be different. But when predicting the test set given in the competition and using drivendata’s scoring algorithm the new model scores slightly better. (Model 1: 0.8114; Model 2: 0.8146). Since random forest models are not deterministic this could also just be caused by the stochastic component of the model building process.

5 Analysis of spatial distribution of different variables

For further analysis the two example map regions already used before are now more closely looked at by using color coding to label the water points for their values at some variables.

## [1] "Mount Kilimanjaro"
## converting bounding box to center/zoom specification. (experimental)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=-3.111206,37.295808&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

The area with the highest density of non functional water points does have a higher density of water points giving insufficient water and older water points. The extraction type is mostly ‘gravity’ and is as the payment type and water point type not a good predictor in this region. While the water point type per se also seems to be a bad predictor, the NA-equivalent ‘other’ can be mostly found at broken water points.

## [1] "Mtwara"
## converting bounding box to center/zoom specification. (experimental)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=-10.255531,40.022676&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

The area around Mtwara is quite different. The non functional water points are more often dry or give insufficient water. Interestingly the dry water points cluster in the city area of Mtwara. An unknown extraction type hints a broken water point, while extraction using submersibles seems to be working well in this region. Standpipes seem to be broken less of the time, while an unknown water point type seems again to be a good indicator for broken water points. The construction year is less informative in this region. Unknown payment type and no payment at all negatively influences the status of water points. In general per bucket payment seems to work relatively well.

It is overall interesting that those two region show differences in the importance of the predictors. Further some variables show clusters of water points with the same value, which strengthens the importance of spatial variables.

6 Final Plots and Summary

6.0.1 Plot One

6.0.2 Description One

The frequency of water points in their respective states plotted against the number of NAs per entry. The plot clearly shows that the number of NAs negatively correlate with the frequency of functional water points.

6.0.3 Plot Two

6.0.4 Description Two

Plot of status frequency against the respective construction years. There is a clear relationship between functionality and age of the water point, showing a negative correlation between functional water points and their age.

6.0.5 Plot Three

6.0.6 Description Three

Map of the area around Mount Kilimanjaro with the water points. The color represents the quantity of water given by the respective water point. The maps show different distributions of functional and not functional water points. Thereby functional water points are mostly located in regions with better access and higher tourism and vice versa. There are clusters of water points giving the same quantity of water. The area containing more non functional water points also contain more dry water points, which in term tend to be older.


7 Summary and Reflection

The data set used in this exploration is is very untidy. A lot of variables are redundant providing the same information at different accuracies. I tried to make a compromise in choosing the variable to keep with the aim to keep as much information as possible, while keeping the data set as clean and non redundant in its values as possible. Some variables have a huge amount of levels. That makes prediction using them impossible. For those variables representing companies, I tried to describe them by their experience, by substituting the variable with the respective counts of water points they worked on. This gave me continuous variables to work with, but actually those variables did not add much information to the model. The other variables with too high amount of levels are regional descriptors. Since the coordinates and region names already provide information of this category, those variables were removed from the data set. The analysis showed that the geographic location of the water point is very important for the prediction of the water points status’. There are several issues with this fact. The political regions are of course not the direct cause of the water points status’, but they stand for several con-founders, like tourism or wealth. Using such data would probably improve the model quite a bit. Even more when using the data for lower level political regions, since this provides a higher geospatial resolution. This observation is further strengthened by the observed clustering of levels for some variables as well as the difference between the two analyzed example areas. The model build for this report is able to predict more than 80 % of the test set data provided by the competition correctly. This already quite helpful for directing maintenance efforts. But it is quite impossible to predict water points that are in the progress of breaking down and need repair. For improving the model further I strongly suggest collecting further additional data that represent the con-founders behind the geographic location, which should be relatively easy to get from governmental agencies in case of wealth, tourism etc..